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Abstract 

Long-range and highly accurate de novo assembly from short-read data is one of the most 
pressing challenges in genomics. Recently, it has been shown that read pairs generated by 
proximity ligation of DNA in chromatin of living tissue can address this problem. These data 
dramatically increase the scaffold contiguity of assemblies and provide haplotype phasing in¬ 
formation. Here, we describe a simpler approach (“Chicago”) based on in vitro reconstituted 
chromatin. We generated two Chicago datasets with human DNA and used a new software 
pipeline (“HiRise”) to construct a highly accurate de novo assembly and scaffolding of a hu¬ 
man genome with scaffold N50 of 30 Mbp. We also demonstrated the utility of Chicago for 
improving existing assemblies by re-assembling and scaffolding the genome of the American 
alligator. With a single library and one lane of lllumina HiSeq sequencing, we increased the 
scaffold N50 of the American alligator from 508 Kbp to 10 Mbp. Our method uses established 
molecular biology procedures and can be used to analyze any genome, as it requires only 
about 5 micrograms of DNA as the starting material. 
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A “holy grail” of genomics is the accurate reconstruction of full-length haplotype-resolved chro¬ 
mosome sequences with low effort and cost. High-throughput sequencing methods have sparked 
a revolution in the field of genomics. By generating data from millions of short fragments of DMA at 
once, the cost of re-sequencing genomes has fallen dramatically, rapidly approaching $1,000 per 
human genome (Sheridan, 2014). Substantial obstacles remain, however, in transforming short 
read sequences into long, contiguous genomic assemblies. 

Currently accessible and affordable high-throughput sequencing methods are best suited to 
the characterization of short-range sequence contiguity and genomic variation. Achieving long- 
range linkage and haplotype phasing requires either the ability to directly and accurately read long 
(i.e., tens of kilobase) sequences, or the capture of linkage and phase relationships through paired 
or grouped sequence reads. 

A number of methods for increasing the contiguity and accuracy of de novo assemblies have 
recently been developed. Broadly, they either attempt to increase the read lengths generated from 
sequencing or increase the insert size between paired short reads. For example, the PacBio RS 
II can produce raw reads up to 23 Kbp (median 2 Kbp) in length but suffers from error rates as 
high as -15% and remains -100-fold more expensive than high-throughput short reads (Keren 
et al., 2012; Quail et al., 2012). Commercially-available long-reads from Oxford Nanopore are 
promising but have even higher error rates and lower throughput(Goodwin et al., 2015). Illumina’s 
TruSeq Synthetic Long-Read technology (formerly Molecule) is limited to 10 Kbp reads maxi¬ 
mum (Voskoboynik et al., 2013). CPT-seq is somewhat similar in approach but does not rely on 
long-range PCR amplification (Adey et al., 2014; Amini et al., 2014). Despite a number of improve¬ 
ments, fosmid library creation (Williams et al., 2012; Wu et al., 2012), remains time-consuming and 
expensive. To date, the community has not settled on a consistently superior technology for large 
inserts or long reads that is available at the scale and cost needed for large-scale projects like the 
sequencing of thousands of vertebrate species(Haussler et al., 2009) or hundreds of thousands 
of humans (Torjesen, 2013). 

The challenge of creating reference-quality assemblies from low-cost sequence data is evident 
in the comparison of the quality of assemblies generated with today’s technologies and the human 
reference assembly (Alkan et al., 2011). Many techniques including BAG clone sequencing, phys¬ 
ical maps, and Sanger sequencing were used to create the high quality and highly contiguous 
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human reference standard with an 38.5 Mbp N50 length and error rate of 1 per 100,000 bases 
(International Human Genome Sequencing Consortium, 2004). In contrast a recent comparison 
of the performance of whole genome shotgun (WGS) assembly software pipelines, each run by 
their developers on very high coverage data sets from libraries with multiple insert sizes, produced 
assemblies with N50 scaffold length ranging up to 4.5 Mbp on a fish genome and 4.0 Mbp on a 
snake genome (Bradnam et al., 2013). 

High coverage of sequence with short reads is rarely enough to attain a high-quality and highly 
contiguous assembly. This is due primarily to repetitive content on both large and small scales, 
including the repetitive structure near centromeres and telomeres, large paralogous gene families 
like zinc finger genes, and the distribution of interspersed nuclear elements such as LINEs and 
SINEs. Such difticult-to-assemble content composes large portions of many eukaryotic genomes, 
e.g. 60-70% of the human genome (de Koning et al., 2011). When such repeats cannot be 
spanned by the input sequence data, fragmented and incorrect assemblies result. In general, the 
starting point for de novo assembly combines deep coverage (50X-200X minimum), short-range 
(300-500 bp) paired-end “shotgun” data with intermediate range “mate-pair” libraries with insert 
sizes between 2 and 8 Kbp, and longer range (35 Kbp) fosmid end pairs (Gnerre et al., 2011; 
Salzberg et al., 2012). Even this is not completely adequate. 

Recently, high-throughput short-read sequencing has been used to characterize the three- 
dimensional structure of chromosomes in living cells. Proximity ligation based methods like Hi-C 
(Lieberman-Aiden et al., 2009) and other chromatin capture-based methods (Kalhor et al., 2012; 
Dixon et al., 2012) rely on the fact that, after fixation, segments of DMA in close proximity in the nu¬ 
cleus are more likely to be ligated together, and thus sequenced as pairs, than are distant regions. 
As a result, the number of read pairs between intra-chromosomal regions is a slowly decreasing 
function of the genomic distance between them. Several approaches have been developed that 
exploit this information for the purpose of genome assembly scaffolding and haplotype phasing. 
(Burton et al., 2013; Selvaraj et al., 2013; Kaplan and Dekker, 2013; Marie-Nelly et al., 2014) 

While Hi-C and related methods can identify long-range chromatin contacts and other bio¬ 
logical features of chromosomes at multi-megabase length scales, the principal signal useful for 
genome assembly and phasing is derived from DNA-DNA contacts on the scale of tens or hun¬ 
dreds of kilobases. These contacts arise from the polymer physics of the nucleosome-wound DNA 
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fiber, rather than from chromatin biology. In fact, the large-scale organization of chromosomes in 
nuclei provides a confounding signal for assembly since, for example, telomeres of different chro¬ 
mosomes are often associated in cells. 

We demonstrate here that DNA linkages up to several hundred kilobases can be produced in 
vitro using reconstituted chromatin rather than living chromosomes as the substrate for the produc¬ 
tion of proximity ligation libraries. The resulting libraries share many of the characteristics of Hi-C 
data that are useful for long-range genome assembly and phasing, including a regular relationship 
between within-read-pair distance and read count. Combining this in vitro long-range mate-pair 
library with standard whole genome shotgun and jumping libraries, we generated a de novo hu¬ 
man genome assembly with long-range accuracy and contiguity comparable to more expensive 
methods, for a fraction of the cost and effort. This method, called “Chicago” (Cell-free Hi-C for 
Assembly and Genome Organization), depends only on the availability of modest amounts of high 
molecular weight DNA, and is generally applicable to any species. Here we demonstrate the value 
of this Chicago data not only for de novo genome assembly using human and alligator, but also 
as an efficient tool for the identification of structural variations and the phasing of heterozygous 
variants. 


Results 

Libraries and sequencing 

We extracted 5.5ng of high molecular weight DNA for each Chicago library (in fragments of ap¬ 
proximately 150 Kbp) from the human cell line GM12878 and from the blood of a wild-caught 
American alligator. We reconstituted chromatin by combining the DNA with purified histones and 
chromatin assembly factors. The reconstituted chromatin was then fixed with formaldehyde and 
Chicago libraries were generated (Figure 1 and Methods). 

For the human GM12878 sample we generated two Chicago libraries using the restriction en¬ 
zymes Mbol and MluCI, which generate 4 bp 5’ overhangs. These barcoded libraries were pooled 
and sequenced on a single lllumina HiSeq 2500 lane in paired lOObp reads, generating 46M Mbol 
and 52M MluCI library read pairs. For comparison, a third library was prepared from nominally 50 
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Figure 1: A diagram of a Chicago library generation protocol. A) Chromatin (nucleosomes in blue) 
is reconstituted in vitro upon naked DNA (black strand) B) Chromatin is fixed with formaldehyde 
(thin, red lines are crosslinks). C) Fixed chromatin is cut with a restriction enzyme, generating 
free sticky ends (performed on streptavidin-coated beads, not shown) D) Sticky ends are filled in 
with biotinylated (blue circles) and thiolated (green squares) nucleotides. E) Free blunt ends are 
ligated (ligations indicated by red asterisks). F) Crosslinks are reversed and proteins removed to 
yield library fragments. 
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Figure 2: Read pair separations for several sequencing libraries mapped to hg19. Green: 50 Kbp 
input human Chicago library. Orange: 150 Kbp input human Chicago library. Light blue: 150 
Kbp input human Chicago library. Dark blue: A human Hi-C library(Kalhor et al., 2012). Dark 
vertical lines indicate maximum advertised or demonstrated capabilities for alternative mate-pair 
technologies. 



Kbp DNA (Figure 2). For the American alligator {Alligator mississipiensis) we constructed a single 
Mbol Chicago library and sequenced it on a single lane, yielding 132M read pairs. To determine 
the utility of these data for genome assembly and haplotype phasing, we aligned the GM12878 
Chicago data to the reference human assembly (hg19) (Figure 2). The Chicago libraries provided 
useful linking information up to separations of 150 Kbp on the genome with a background noise 
rate of approximately one spurious link between unrelated 500 Kbp genomic windows (mean of 
0.97 such links). The single lane of sequence from the GM12878 libraries provided linking in¬ 
formation equivalent to 3.8X, 8.4X, 8.6X, 18.6X, 13.5X, 6.5X physical coverage in 0-1, 1-5, 5-10, 
10-25, 25-50, and 50-200 Kbp libraries respectively, while for aligator the comparable coverage 
estimates were 5.4X, 16.7X, 16.7X, 42.2X, 36.1 X, 16.5X respectively (Figure 3). 

Chicago data for genome scaffolding 

To determine the power and utility of data extracted from Chicago libraries, we focused on genome 
assembly and scaffolding using only generic 300-500 bp insert lllumina shotgun libraries and 
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Figure 3: Genome coverage (sum of read pair separations divided by estimated genome size) in 
various read pair separation bins. 
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the Chicago libraries described above. We also tested the benefit of adding Chicago data to 
datasets with a broader range of insert sizes. We initially assembled a previously-described 
84x, 101 bp paired-end lllumina shotgun data set from GM12878 (Simpson and Durbin, 2012) 
using MERACULOUS (Chapman et al., 2011) into scaffolds with typical (N50) size of 33 Kbp. 
We mapped the Chicago read pairs to this initial assembly as described in Methods. We found 
that 68.9% of read pairs mapped such that both forward and reverse reads had map-quality 
>= 20 and were thus considered uniquely mapping within the assembly and were not dupli¬ 
cates. 26.8% of these read pairs had forward and reverse reads that mapped to different con- 
tigs and were thus potentially informative for further scaffolding of the assembly. We also used 
the same Chicago data to scaffold a Discovar assembly of SOX coverage in paired 250bp reads 
(ftp: //ftp.broadinstitute. org/pub/crd/Discovar/assemblies) (Sharpe et al., 2015). 

We developed a likelihood model describing how Chicago libraries sample genomic DNA, and 
a software pipeline called “HiRise” for breaking and re-scaffolding contigs based on Chicago links 
(Methods). We compared the completeness, contiguity and correctness at local and global scales 
of the resulting assembly to assemblies of rich WGS datasets, including extensive coverage in 
fosmid end pairs created by two of the leading WGS assemblers: MERACULOUS (Chapman 
et al., 2011) and ALLPATHS-LG (APLG) (Gnerre et al., 2011) (Table 1). To avoid the arbitrary 
choices involved in constructing alignment-based comparisons of assembly quality, we based our 
comparison on the assembled positions of 25.4 million 101 bp sequences that are a randomly- 
selected subset of all distinct 101 bp sequences that occur exactly once in each haplotype of the 
diploid NA12878 assembly(Rozowsky et al., 2011). 

Long range scaffolding accuracy 

The genomic scaffolds that the HiRise pipeline produced were longer and had a lower rate of 
global mis-assemblies than the published MERACULOUS and APLG assemblies, both of which 
rely on deep coverage in paired fosmid end reads. Table 1 shows the fraction of the total assembly 
found in scaffolds containing a misjoin, where a misjoined scaffold is defined as having a stretch 
of unique 101-mers spanning at least 5 Kbp, 10 Kbp or 50 Kbp from more than one chromosome 
in the diploid reference. The table also shows measures of completeness and contiguity for four 


8 



successive rounds of HiRise assembly compared to other assemblies of NA12878. 


Table 1: Scaffolding results. Fraction of each assembly in scaffolds containing a misjoin at three 
different thresholds for identifying misjoins. Scaffold N50. 50 Kbp separation discrepancy 95% 
confidence interval (95% Cl = x means: Given a pair of unique 101-mer tags in the assembly, 
95% of them are within 50 Kbp ± x of each other in the reference.) completeness (%C); fraction of 
lOlmers misoriented. 



5 Kbp 

Misjoins 
10 Kbp 

50 Kbp 

N50 

(Mbp) 

95% Cl 

50 kbp A 

%C 

% mis- 
oriented 

MERACULOUS 

0.032 

0.030 

0.022 

9.1 

1.3 Kbp 

94.8 

0.09 

APLG 

0.245 

0.187 

0.130 

12.1 

6.4 Kbp 

92.2 

0.4 

MERAC (33Kbp N50) + HiRise: 

0.028 

0.011 

0.009 

12.6 

7.7 Kbp 

94.1 

1.3 

MERAC (33Kbp N50) + HiRiseO.9.8 

0.052 

0.029 

0.014 

14.9 

7.6 Kbp 

94.1 

1.2 

Discover (178Kbp N50) -i- HiRise: 

0.102 

0.097 

0.076 

29.9 

3.2 Kbp 

97.9 

1.4 


Because the DNA ligation events that create Chicago pairs are not constrained to produce 
read pairs of defined relative strandedness, contig relative orientations during scaffolding must be 
inferred from read density information. As a result, the Chicago HiRise scaffolds have a higher 
rate of mis-oriented 101-mers (1.3%) than is found in the other assemblies, most occurring in 
small contigs. The median size of contigs containing mis-oriented 101-mers was 2.1 Kbp. 


Improving the alligator assembly with Chicago data 

To further assess the utility of Chicago data for improving existing assemblies, we generated a 
single Chicago library for the American alligator Alligator mississippiensis and sequenced 210.7 
million reads on the lllumina HiSeq 2500. We mapped these data to a de novo assembly (N50 81 
Kbp) created using publicly available data (Green et al., 2014) and applied the HiRise scaffolding 
pipeline. The resulting assembly had a scaffold N50 of 10.3 Mbp. To assess the accuracy of these 
scaffolds, we aligned a collection of 1,485 previously generated (Shedlock et al., 2007) bacterial 
artificial chromosome (BAC) end sequences to the assembly. Of those 1,298 pairs were uniquely 
aligned by GMAP (Wu and Watanabe, 2005) with 90% coverage and 95% identity to the genome 
assembly and the HiRise scaffolded version. In the input assembly, 12.5% of BAC end pairs 
were captured in the same scaffold with the expected orientation and separation. In the HiRise 
assembly 96.5% of BAC end pairs were aligned in the same scaffold with 98.1% of BAC end pairs 
on same scaffold in correct relative orientation. 5 (0.39%) BAC end pairs were placed on the 
same scaffold but at a distance significantly larger than insert size and 14 (1.08%) were placed on 
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separate scaffolds but far enough from edge of scaffold that distance would be larger than insert 
size suggesting a global density of misjoins of less than 1 per 8.36 Mbp of assembly. 

Phase accuracy of Chicago read pairs 

As shown for Hi-C data (Selvaraj et al., 2013), read pairs formed by proximity ligation are nearly 
always the product of ligation between a single contiguous DNA strand. Thus, read pairs where 
both the forward and reverse read cover a heterozygous site can be used to directly read haplo- 
type phase as is done using fosmid or other mate-pair library data. Because the distance covered 
in Chicago read pairs can be as great as the size of the input DNA, we assessed the accuracy of 
phase information and its utility for determining haplotype phase in the GM12878 sample. Impor¬ 
tantly, because GM12878 derives from an individual that has been trio-sequenced, gold-standard 
haplotype phase information is available to check the accuracy of Chicago phasing information. 
Read pairs that are haplotype informative and that span between 10 Kbp and 150 Kbp are 99.83% 
in agreement with the known haplotype phase for GM12878. 

Identification of structurai variants 

Mapping paired sequence reads from one individual against a reference is the most commonly 
used sequence-based method for identifying differences in genome structure like inversions, dele¬ 
tions and dupications (Tuzun et al., 2005). Figure 4 shows how Chicago read pairs from GM12878 
mapped to the human reference genome GRCh38 reveal two such structural differences. To esti¬ 
mate the sensitivity and specificity of Chicago data for identifying structural differences, we tested 
a simple maximum likelihood discriminator (Methods) on simulated data sets constructed to sim¬ 
ulate the effect of heterozygous inversions. We constructed the test data by randomly selecting 
intervals of a defined length L from the mapping of our Chicago NA12878 reads to the GRCh38 ref¬ 
erence sequence and assigning each Chicago read pair independently at random to the inverted 
or reference haplotype, and editing the mapped coordinates accordingly. Non-allelic homologous 
recombination is responsible for much of the structural variation observed in human genomes, 
resulting in many variation breakpoints that occur in long blocks of repeated sequence(Kidd et al., 
2008). We simulated the effect of varying lengths of repetitive sequence surrounding the inversion 
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Figure 4: The mapped locations on the GRCh38 reference sequence of Chicago read pairs are 
plotted in the vicinity of structural differences between GM12878 and the reference. Each Chicago 
pair is represented both above and below the diagonal. Above the diagonal, color indicates map 
quality score on scale shown; below the diagonal colors indicate the inferred haplotype phase of 
Chicago pairs based on overlap with a phased SNPs. 
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breakpoints by removing all reads mapped to within a distance W of them. In the absence of repet¬ 
itive sequences at the inversion breakpoints, we found that for 1 Kbp, 2 Kbp and 5 Kbp inversions 
respectively, the sensitivities (specificities) were 0.76 (0.88), 0.89 (0.89) and 0.97 (0.94) respec¬ 
tively. Simulating 1 Kbp regions of repetitive (unmappable) sequence at the inversion breakpoints, 
the sensitivity (specificity) for 5 Kbp inversions was 0.81 (0.76). 

Discussion 

We have described an /n vitro method for generating long range mate-pair data that can dramat¬ 
ically improve the scaffolding of de novo assembled genomes from high-throughput sequencing 
data. This approach has several advantages over existing methods. First, Chicago library con¬ 
struction requires no living biological material, i.e., no primary or transformed tissue culture or 
living organism. The libraries described here were each generated from 5.0 micrograms of input 
DNA. Furthermore, although the in vitro chromatin reconstitution is based on human histones and 
chromatin assembly factors, DNA from a wide variety of plants, animals, and microbes can be 
substrate for in vitro chromatin assembly using the protocol described (data not shown). Sec¬ 
ond, because Chicago data are generated from proximity ligation of chromatin assembled in vitro 
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rather than chromatin obtained from in vivo sources there is no confounding biological signal (e.g., 
telomeric clustering, or chromatin looping) to potentially confuse the assembly. Hi-C and or other 
proximity ligation data generated from in vivo chromatin carries within it long-range proximity in¬ 
formation that is of biological relevance, but is persistent and potentially confounding for genome 
assembly. As expected for in vitro assembled chromatin, we find a low background rate of noise 
and a virtual absence of persistent and spurious read pairs. Third, in contrast to in vivo Hi-C 
methods, the maximum separation of the read pairs generated is limited only be the molecular 
weight of the input DMA. This has allowed us to generate highly contiguous scaffolding of verte¬ 
brate genomes using just short fragment lllumina sequence plus Chicago libraries. To date, high 
quality scaffolding based on in vivo Hi-C libraries have started from assemblies with an order of 
magnitude more scaffold contiguity than the 30 Kbp N50 input contigs successfully scaffolded by 
Chicago HiRise. Fourth, these libraries eliminate the need for creating and sequencing a com¬ 
bination of long range “mate-pair” and fosmid libraries, and do not require the use of expensive, 
specialized equipment for shearing or size-selecting high molecular weight DMA normally required 
to create such libraries. In summary, we have presented a simple DMA library construction and 
associated bioinformatic methods that generate significantly longer-range genome assembly scaf¬ 
folds than existing methods. Furthermore, we have demonstrated the usefulness of our data for 
the discovery of genome variation. Our methods and results mark a substantial step toward the 
goal of accurate reconstruction of full-length haplotype-resolved chromosome sequences with low 
effort and cost. 
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Methods 


DNA Preparation 

DNA was extracted with Qiagen Blood and Cell Midi kits according to the manufacturer’s instruc¬ 
tions. Briefly, cells were lysed, and centrifuged to isolate the nuclei. The nuclei were further 
digested with a combination of Proteinase K and RNAse A. The DNA was bound to a Qiagen 
genomic column, washed, eluted and precipitated in isopropanol and pelleted by centrifugation. 
After drying, the pellet was resuspended in 200 TE (Qiagen). 

Chromatin assembly 

Chromatin was assembled overnight at 21 from genomic DNA using the Active Motif in vitro 
Chromatin Assembly kit. Following incubation, 10% of the sample was used for MNase digestion 
to confirm successful chromatin assembly. 

Biotinylation and restriction digestion 

Chromatin was biotinylated with iodoacetyl-PEG-2-biotin (IPB). Following biotinylation, the chro¬ 
matin was fixed in 1 % formaldehyde at room temperature (RT) for 15 minutes, followed by a quench 
with 2-fold molar excess of 2.5M Glycine. Excess IPB and cross-linked glycine were removed by 
dialyzing chromatin in a Slide-A-Lyzer 20 KDa MWCQ dialysis cassette (Pierce) against 1L of dial¬ 
ysis buffer (10mm Tris-CI, pH8.0, ImM EDTA) at 4°C for a minimum of 3 hours. Subsequently, 
the chromatin was digested with either Mbol or MluCI in IX CutSmart for 4 hours at 37Ti;. The 
chromatin was again dialyzed in a 50 KDa MWCQ dialysis Flex tube (IBI Scientific # IB48262) at 
4'Q for 2 hours, then again with fresh buffer overnight, to remove enzyme as well as short, free 
DNA fragments. 

Dynabead MyQne Cl streptavidin beads were prepared by washing and resuspending in PBS 
-I- 0.1% Tween-20, before adding to chromatin and incubating for 1 hour at RT. The beads were 
then concentrated on a magnetic concentrator rack, before being washed, re-concentrated, and 
resuspended in 100 IX NEBuffer 2. 
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dNTP fill-in 


To prevent the labeled dNTP’s (Figure 1) from being captured during the fill-in reaction, un-bound 
streptavidin sites were occupied by incubating beads in the presence of free biotin for 15 minutes at 
RT. Subsequently, the beads were washed twice before being resuspended in 100 fiL IX NEBuffer 
2 . 

Sticky ends were filled in by incubating with dNTPs, including a-S-dGTP and biotinylated dCTP 
along with 25 U of Klenow (#M0210M, NEB) in 165 total volume at 25°C for 40 minutes. The 
fill-in reaction was stopped by adding 7 of 0.5M EDTA. The beads were then washed twice in 
Pre-ligation wash buffer (PLWB: 50mM Tris 7.4; 0.4% Triton X-100; 0.1 mM EDTA), before being 
resuspended in 100/xL PLWB. 

Ligation 

Ligation was performed in at least ImL of T4 ligation buffer at 16°C for a minimum of 4 hours. A 
large ligation volume was used to minimize cross-ligation between different chromatin aggregates. 
The ligation reaction was stopped by adding 40fj.L of 0.5M EDTA. The beads were concentrated 
and resuspended in 100/iL extraction buffer (50mM Tris-cl pH 8.0, ImM EDTA, 0.2% SDS). After 
adding 400ug Proteinase K (#P8102S, NEB) the beads were incubated overnight at 55°C, followed 
by a 2 hour digestion with an additional 200 ug Proteinase K at 55°C. DNA was recovered with 
either SPRI beads at a 2:1 ratio, a column purification kit, or with a phenol:chloroform extraction. 
DNA was eluted into Low TE (10 mM Tris-CI pH 8.0, 0.5 mM EDTA). 

Exonuclease digestion 

DNA was next digested for 40 minutes at 37°C with 100 U Exonuclease III (#M0206S, NEB) to 
remove biotinylated free ends, followed by SPRI cleanup and elution into 101 /iL low TE 

Shearing and Library Prep 

DNA was sheared using a Diagenode Bioruptor set to ’Low’ for 60 cycles of 30 seconds on / 30 
seconds off. After shearing, the DNA was filled in with Klenow polymerase and T4 PNK (#EK0032, 


14 


Thermo Scientific) at 20^1; for 30 minutes. Following the fill-in reaction, DNA was pulled down 
on C1 beads that had been prepared by washing twice with Tween wash buffer, before being 
resuspended in 200 2X NTB (2M NaCI, lOmM Tris pH 8.0, 0.1 mM EDTA pH 8.0, 0.2% Triton 
X-100). Once the sample was added, the beads were incubated at room temperature for 20 
minutes with rocking. Subsequently, unbiotinylated DNA fragments were removed by washing the 
beads three times before resuspending in Low TE. Sequencing libraries were generated using 
established protocols (Meyer and Kircher, 2010). 

Read mapping 

Sequence reads were truncated whenever a junction was present (GATCGATC for Mbol, AAT- 
TAATT for MluCI). Reads were then aligned using SMALT [http://www.sanger.ac.uk/resources/ 
software/smait/] with the -X option to independently align forward and reverse reads. PCR du¬ 
plicates were marked using Picard-tools MarkDuplicates [http://broadinstitute.github.io/ 
picard/]. Non-duplicate read pairs were used in analysis if both reads mapped and had a map¬ 
ping quality greater than 10. 

De novo assemblies 

The human and alligator de novo shotgun assemblies were generated with Meraculous 2.0.3 
(Chapman et al., 2011) using publicly available short-insert and mate-pair reads (Simpson and 
Durbin, 2012; Green et al., 2014). The alligator mate-pair reads were adapter-trimmed with Trim- 
momatic (Bolger et al., 2014). Some overlapping alligator short-insert reads had been “merged.” 
These were unmerged back into forward and reverse reads. 

Chicago HighRise (HiRISE) Scaffolder 
Input pre-processing 

To exclude Chicago reads that map to highly repetitive genomic regions likely to provide misleading 
links, we used the depth of aligned shotgun reads to identify problematic intervals. We used a 
double threshold strategy: identify all intervals of the starting assembly with mapped shotgun read 
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depth exceeding ti that contain at least one base with a mapped read depth exceeding t 2 . In 
practice we set ti and t 2 such that about 0.5% of the assembly was masked. We also excluded all 
Chicago links falling within a 1 Kbp window on the genome which is linked to more than four other 
input contigs by at least two Chicago links. 

Estimation of iikeiihood modei parameters 

Several steps of the HiRise pipeline use a likelihood model of the Chicago data to guide assem¬ 
bly decisions or to optimize contig order and orientation within scaffolds. The likelihood function 
L{11,12, g, o) = nr=i /('^*) gives the probability of observing the number n and 

implied separations of spanning Chicago pairs di betwen contigs 1 and 2, assuming the con- 
tigs have relative orientations o g —,—h,— and are separated by a gap of length $g.$ 
The function f{x) is the normalized probability distribution over genomic separation distances of 
Chicago read pairs, and is assumed to have a contribution from “noise” pairs which sample the 
genome independently. f{x) = Pn/G -k (1 - Pn)f{x), and f'{x) is represented as a sum of expo¬ 
nential distributions. 

To obtain robust estimates of N, pn, G, and f{x) when the available starting assembly has 
limited contiguity, we first fixed an estimate of the product Npn, the total number of “noise” pairs 
by tabulating the densities of links (defined as n/hh) for a sample of contig pairs, excluding the 
highest and lowest 1% of densities, and setting Nn = G’^Y^ruj/Y^klj, using the sum of the 
lengths of input contigs as the value of G. We then fit the remaining parameters in Nf{x) by 
least squares to a histogram of observed separations of Chicago read pairs mapped to starting 
assembly contigs, after applying a multiplicative correction factor of G{Yj^=imin{^,li - x))~^ to 
the smoothed counts at separation x. 

Break low-support joins in the input contigs 

To identify and break candidate misjoins in the starting assembly, we used the likelihood model to 
compute the log likelihood change gained by joining the left and right sides of each position i of 
each contig in the starting assembly (i.e. the log likelihood ratio (LLR) L* = InL(g = 0)/L{g = oo) 
for the two contigs that would be created by breaking at position i) When this support fell below a 
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threshold value 4 over a maximal internal segment of an input contig, we defined the segment as 
a “low support” segment. After merging low support segments lying within 300bp of one another, 
and excluding those within 1 Kbp of a contig end, we either (a) introduced a break in the contig 
at the midpoint of the segment, or (b) if the segment is longer than 1000 bp, introduced breaks at 
each end of the segment. 

Contig-contig linking graph construction 

During the assembly process, the Chicago linking data was represented as a graph in which 
(broken) contigs of the starting assembly are nodes and edges are labeled with a list of ordered 
pairs of integers, each representing the positions in the two contigs of the reads from a mapped 
Chicago pair. The initial steps of scaffolding were carried out in parallel on subsets of the data 
created by partitioning the graph into connected components by excluding edges with fewer than 
a threshold ti number of Chicago links, where the lowest integer threshold that did not give rise 
to any connected components that comprised more than 5% of the input contigs. 

Seed scaffold construction 

The iterative phase of scaffold construction was seeded by filtering the edges of the contig-contig 
graph and decomposing it into high-confidence linear subgraphs. First, the contig-contig edges 
were filtered and the minimum spanning forest of the filtered graph was found (see “edge filtering” 
below). The graph was linearized by three successive rounds of removing nodes of degree 1 
followed by removal of nodes with degree greater than 2. Each of the connected components of 
the resulting graph had a linear topology and defined an ordering of a subset of the input contigs. 
The final step in the creation of the initial scaffolds was to find the maximum likelihood choice of 
the contig orientations for each linear component. 

Edge filtering 

The following filters were applied to the edges of the contig-contig graph before linearization. 
Edges from “promiscuous” contigs were excluded. “Promiscuous” contigs were those for which 
the ratio of the degree in the graph of the corresponding node to the contig length in basepairs 
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exceeds tp, or have links with at least links to more than dm other contigs. The thresholds 
tp and dm were selected to exclude approximately 5% of the upper tail of the distribution of the 
corresponding value. 

Contig orienting 

Each input scaffold can have one of two orientations in the final assembly, corresponding to the 
base sequences of the forward and reverse, or "Watson" and "Crick" DNA strands. The optimal 
orientations for the scaffolds in each linear string was found by dynamic programming using the 
following recursion relationship: In an ordered list of scaffolds of length n, the score of the highest- 
scoring sequence of orientation choices for the scaffolds up to scaffold i, such that scaffolds i - k 
to i have particular orientations Oi-k,Oi-k+i, ...Oi is given by: 


j=i-l 

'S'm Oj—fc) Oj—fc+1) • ••) Oj ) — max {Smi^i 1; Oi—l —fc; ..., Oj_l) -|- \ ^ logp(oj,Oj)) 

j=i—i—k 

Including links from contigs k steps back provided a significant improvement in orientation 
accuracy because small intercalated scaffolds might only have linking and therefor orientation 
information on one side, with important orientation information for the flanking scaffolds coming 
from links that jump over it. 

Merge scaffolds within components 

Contig ends were classified as “free” if they lie at the end of a scaffold, or “buried” if they were 
internal to a scaffold. For all pairs of contig ends within each connected component, the LLR 
score for joining them was computed with a "standard" gap size of go- These candidate joins were 
sorted in decreasing order of score and evaluated according to the following criteria. If both ends 
are free and from different scaffolds, test linking the two scaffolds end-to-end. If one end is buried 
and the other is free, and the ends are from different scaffolds, test inserting the scaffold of the 
free end into the gap adjacent into the buried end. If one or both ends is buried and the ends are 
on the same scaffold, test inverting the portion of the scaffold between the two ends. If both ends 
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are buried and from different scaffolds, test all four ways of joining the scaffolds end-to-end. In all 
cases, the possible joins, insertions and inversions were tested by computing the total change in 
LLR score by summing the LLR scores between all pairs of contigs affected by the change. If the 
change increased the LLR score, the best move was accepted. 

Local order and orientation refinement 

To refine both the local ordering and orientations of contigs in each scaffold, a dynamic program¬ 
ming algorithm was applied that slides a window of size w across the ordered and oriented contigs 
of each scaffold. At each position i, all the w\2^ ways of ordering and orienting the contigs within 
the window were considered, and a score representing the optimal ordering and orientation of all 
the contigs up to the end of the current window position that ends with the current O&O of the 
contigs in the window was stored. The scores of all "compatible" O&Os in windows at positions 
i-l,i-2,... i-w, and scores the extension of their orderings with the current O&O were used. 
Since wl2'^ is such a steep function, the method is limited in practice to small values of w. 

Iterative joining 

After the initial scaffolds had been constructed within each connected component, the resulting 
scaffolds were returned to a single pool, and multiple rounds of end-to-end and intercalating scaf¬ 
fold joins were carried out. In each round, all pairs of scaffolds were compared, and likelihood 
scores were computed in parallel for end-to-end and intercalating joins. The candidate joins were 
then sorted and non-conflicting joins were accepted in decreasing order of likelihood score in¬ 
crease. 
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